Nutch: an Open-Source Platform for Web Search
نویسنده
چکیده
Nutch is an open-source project providing both complete Web search software and a platform for the development of novel Web search methods. Nutch is built on a distributed storage and computing foundation, such that every operation scales to very large collections. Core algorithms crawl, parse and index Web-based data. Plugins extend functionality at various points, including network protocols, document formats, indexing schemas and query operators.
منابع مشابه
Implementation of MapReduce Algorithm and Nutch Distributed File System in Nutch
This paper provides an in-depth description of MapReduce algorithm and Nutch Distributed File System in Nutch web search engine. Nutch is an open-source Web search engine that can be used at global, local, and even personal scale. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. The...
متن کاملNutch: A Flexible and Scalable Open-Source Web Search Engine
Nutch is an open-source Web search engine that can be used at global, local, and even personal scale. Its initial design goal was to enable a transparent alternative for global Web search in the public interest — one of its signature features is the ability to “explain” its result rankings. Recent work has emphasized how it can also be used for intranets; by local communities with richer data m...
متن کاملFull Text Search of Web Archive Collections
The Internet Archive, in cooperation with the International Internet Preservation Consortium, is developing an open source full text search of Web archive collections. Web archive collection search presents the usual set of technical difficulties searching large collections of documents. It also introduces new challenges often at odds with typical search engine usage. This paper outlines the ch...
متن کاملDesign and Implementation of Agricultural Production and Market Information Matching Recommendation System in the Cloud Environment
At present, in China, small farmers decentralized production doesn’t keep pace with the agricultural products market development requirements. This paper provides a new way by using the cloud computing technology to design and implement the agricultural production and marketing information matching recommendation system in the cloud computing environment. The platform collects agricultural mark...
متن کاملTREC Dynamic Domain: Polar Science
This paper outlines the creation of the Polar dataset within the TREC-Dynamic Domain track. The techniques used to create the Polar dataset fall into two basic categories: information extraction using Apache Tika and information retrieval using Apache Nutch. Frist, we expanded the parsing capabilities of Apache Tika, an open source framework for text and metadata extraction, to provide more sea...
متن کامل